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Abstract 


The  objective  of  this  project  is  to  develop  scalable  parallel  formulations  of  the  key  computational 
kernels  used  in  scientific  simulations.  The  specific  problems  investigated  in  this  project  are  fast  and 
high  quality  graph  partitioners,  highly  parallel  direct  solvers,  and  parallel  formulations  of  robust 
preconditioners  for  iterative  solvers,  as  well  as  parallel  formuations  of  particle  simulation  techniques. 
We  have  developed  a  fast  and  high  quality  parallel  formulations  of  our  multilevel  graph  partitioning 
algorithm  that  are  able  to  partition  very  large  graphs  quickly  on  parallel  computers,  making  it 
feasible  to  perform  frequent  repartitioning  of  the  adaptive  and  unstructured  mesh  in  adaptive  FEM 
computations.  We  have  developed  massively  parallel  formulations  of  particle  simulation  techniques 
such  as  Fast  Multipole  and  Barnes-Hut  methods,  and  have  investigated  the  use  of  this  formulation 
for  solving  dense  linear  systems  arising  in  boundary  element  solution  of  integral  equations.  We  have 
developed  an  MPI-based  portable  library,  called  PSPASES,  that  has  been  used  to  solve  some  of  the 
largest  sparse  linear  systems  that  have  been  solved  using  direct  methods.  We  have  also  developed 
robust  and  parallel  preconditioners  for  iterative  solvers  using  our  fast  graph  partitioning  technique 
and  highly  parallel  Cholesky  factorization. 
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1  Problem  Statement 

Virtually  all  scientific  and  natural  phenomena  can  be  modeled  as  systems  of  differential  equations 
that  are  solved  using  finite  element  and  finite  difference  methods.  The  objective  of  this  project  is 
to  solve  linear  systems  arising  from  these  methods.  These  sparse  linear  systems  are  too  large  to 
be  solved  cost  effectively  on  traditional  vector-supercomputers.  This  project  aims  at  developing 
highly  parallel  linear  system  solvers  and  investigating  their  applications  in  problems  of  interest 
to  US  Army.  This  work  has  considerable  significance  since  it  will  enable  modeling  accuracies  and 
discretizations  much  finer  than  currently  possible.  It  will  also  result  in  robust  and  portable  software 
that  can  be  used  for  a  variety  of  applications. 

2  Summary  of  Important  Results 

We  have  developed  highly  efficient  parallel  formulations  of  our  multi-level  graph  partitioning  al¬ 
gorithm  that  achieve  high  degree  of  concurrency,  while  maintaining  the  high  quality  partitions 
produced  by  the  serial  multi-level  algorithm.  Our  parallel  graph  partitioning  algorithm  can  par¬ 
tition  very  large  graphs  (e.g.,  over  8  million  vertices)  in  128  parts  on  a  128-processor  Cray  T3D 
m  under  10  seconds.  Thus,  our  parallel  algorithm  makes  it  possible  to  perform  frequent  mesh 
re-partitioning  in  adaptive  computations  without  compromising  quality. 

In  adaptive  finite  element  computation,  dynamic  adjustments  to  the  mesh  require  repartition- 
mg  of  the  mesh  to  improve  load  balance.  This  re-partitioning  also  results  in  movement  of  data 
structures  associated  with  graph  vertices.  Hence,  a  good  re-partitioning  algorithm  should  minimize 
the  movement  of  vertices  (in  addition  to  balancing  the  load  and  minimizing  the  cut  of  the  result¬ 
ing  new  partition).  In  this  project,  we  have  developed  parallel  graph  partitioned  that  minimize 
the  movement  of  data  (in  addition  to  minimizing  the  edge  cut  of  the  partitioning)  when  a  mesh 
is  adaptively  refined.  This  partitioner  will  also  facilitate  the  development  of  highly  parallel  mesh 

generators,  as  the  mesh  generation  algorithms  also  generate  the  mesh  by  adaptively  refining  it  at 
various  places.  "  ° 

Factorization  algorithms  based  on  threshold  incomplete  LU  factorization  (ILUT)  have  been 
found  to  be  quite  effective  in  preconditioning  iterative  system  solvers.  However,  because  these 
factorizations  allow  the  fill  elements  to  be  created  dynamically,  their  parallel  formulations  had  not 
been  well  understood,  and  they  had  been  considered  to  be  unsuitable  for  distributed  memory  paral¬ 
lel  computers.  We  have  developed  a  highly  parallel  formulation  of  the  ILUT  factorization  algorithm 
for  distributed  memory  parallel  computers.  This  algorithm  uses  our  graph  partitioning  algorithm 
in  conjunction  with  a  parallel  maximal  independent  subset  algorithm  to  effectively  parallelize  both 
the  factorization  as  well  as  the  solution  of  the  resulting  triangular  factors.  Our  experiments  has 
shown  that  both  the  ILUT  factorization  as  well  as  the  solution  of  the  resulting  triangular  systems 
can  be  performed  very  fast  on  distributed  memory  parallel  computers.  Furthermore,  our  exper¬ 
iments  using  the  GMRES  iterative  solver  show  that  the  amount  of  time  spent  in  computing  the 
factorization  is  usually  much  less  than  the  amount  of  time  required  to  solve  the  systems. 

We  have  developed  massively  parallel  formulations  of  particle  simulation  techniques  such  as 
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Fast  Multipole  and  Barnes-Hut  methods,  and  have  used  this  formulation  for  solving  dense  linear 
systems  arising  in  boundary  element  solution  of  integral  equations.  For  a  problem  with  200,000 
unknowns,  we  demonstrate  over  two  orders  of  magnitude  speedup  from  parallelization  and  another 
two  orders  of  magnitude  from  approximation  in  our  preliminary  implementation. 
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